Kernel-Based Reinforcement Learning Using Bellman Residual Elimination
نویسندگان
چکیده
This paper presents a class of new approximate policy iteration algorithms for solving infinite-horizon, discounted Markov decision processes (MDPs) for which a model of the system is available. The algorithms are similar in spirit to Bellman residual minimization methods. However, by exploiting kernel-based regression techniques with nondegenerate kernel functions as the underlying cost-to-go function approximation architecture, the new algorithms are able to explicitly construct cost-to-go solutions for which the Bellman residuals are identically zero at a set of chosen sample states. For this reason, we have named our approach Bellman residual elimination (BRE). Since the Bellman residuals are zero at the sample states, our BRE algorithms can be proven to reduce to exact policy iteration in the limit of sampling the entire state space. Furthermore, by exploiting knowledge of the model, the BRE algorithms eliminate the need to perform trajectory simulations and therefore do not suffer from simulation noise effects. The theoretical basis of our approach is a pair of reproducing kernel Hilbert spaces corresponding to the cost and Bellman residual function spaces, respectively. By construcing an invertible linear mapping between these spaces, we transform the problem of performing BRE into a simple regression problem in the Bellman residual space. Once the solution in the Bellman residual space is known, the corresponding cost function is found using the linear mapping. This theoretical framework allows any kernel-based regression technique to be used to perform BRE. The main algorithmic results of this paper are two BRE algorithms, BRE(SV) and BRE(GP), which are based on support vector regression and Gaussian process regression, respectively. BRE(SV) is presented first as an illustration of the basic idea behind our approach, and this approach is then extended to develop the more sophisticated BRE(GP). BRE(GP) is a particularly useful algorithm, since it can exploit techniques from Gaussian process regression to automatically learn the kernel parameters (via maximization of the marginal likelihood) and provide error bounds on the solution (via the posterior covariance). Experimental results demonstrate that both BRE(SV) and BRE(GP) produce good policies and cost approximations for a classic reinforcement learning problem.
منابع مشابه
Reinforcement learning with kernels and Gaussian processes
Kernel methods have become popular in many sub-fields of machine learning with the exception of reinforcement learning; they facilitate rich representations, and enable machine learning techniques to work in diverse input spaces. We describe a principled approach to the policy evaluation problem of reinforcement learning. We present a temporal difference (TD) learning using kernel functions. Ou...
متن کاملAdaptive Bases for Reinforcement Learning
We consider the problem of reinforcement learning using function approximation, where the approximating basis can change dynamically while interacting with the environment. A motivation for such an approach is maximizing the value function fitness to the problem faced. Three errors are considered: approximation square error, Bellman residual, and projected Bellman residual. Algorithms under the...
متن کاملIs the Bellman residual a bad proxy?
This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. For that purpose, we place ourselves in the framework of policy search algorithms, that are usually designed to maximize the mean value, and derive a method that minimizes the residual ‖T∗vπ − vπ...
متن کاملBoosted Bellman Residual Minimization Handling Expert Demonstrations
This paper addresses the problem of batch Reinforcement Learning with Expert Demonstrations (RLED). In RLED, the goal is to find an optimal policy of a Markov Decision Process (MDP), using a data set of fixed sampled transitions of the MDP as well as a data set of fixed expert demonstrations. This is slightly different from the batch Reinforcement Learning (RL) framework where only fixed sample...
متن کاملRegularized Policy Iteration
In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by addingL-regularization to...
متن کامل